expectation suite
How to Test PySpark ETL Data Pipeline
Garbage in garbage out is a common expression used to emphasize the importance of data quality for tasks such as machine learning, data analytics and business intelligence. With increasing amount of data being created and stored, building high quality data pipelines have never been more challenging. PySpark is a commonly used tool to build ETL pipelines for large datasets. A common question that arises while building data pipeline is "How do we know that our data pipeline is transforming the data in the way that is intended?". To answer this question, we borrow the idea of unit test from the software development paradigm.
Reducing Pipeline Debt With Great Expectations
This article was first published on Neptune AI's blog. You are a part of a data science team at a product company. Your team has a number of machine learning models in place. Their outputs guide critical business decisions, as well as a couple of dashboards displaying important KPIs that are closely watched by your executives day and night. On that fatal day, you had just brewed yourself a cup of coffee and were about to begin your workday when the universe collapsed. Everyone at the company went crazy. The business metrics dashboard was displaying what seemed to be random numbers (except every full hour, when the KPIs look okay for a short time) and the models were predicting the company's insolvency looming fast. What is worse, every attempt to resolve this madness resulted in your data engineering and research teams reporting new broken services and models. That was the debt collection day and the unpaid debt was of the worst kind: pipeline debt.